Distributing power to heterogeneous compute elements of a processor

ABSTRACT

In one embodiment, the present invention includes a processor having a first domain with a first compute engine and a second domain with a second compute engine, where each of these domains can operate at an independent voltage and frequency. A first logic may be present to update a power bias value used to control dynamic allocation of power between the first and second domains based at least in part on a busyness of the second domain. In turn, a second logic may dynamically allocate at least a portion of a power budget for the processor between the domains based at least in part on this power bias value. Other embodiments are described and claimed.

This application is a continuation of U.S. patent application Ser. No.13/621,478, filed Sep. 17, 2012, the content of which is herebyincorporated by reference.

BACKGROUND

As technology advances in the semiconductor field, devices such asprocessors incorporate ever-increasing amounts of circuitry. Over time,processor designs have evolved from a collection of independentintegrated circuits (ICs), to a single integrated circuit, to multicoreprocessors that include multiple processor cores within a single ICpackage. As time goes on, ever greater numbers of cores and relatedcircuitry are being incorporated into processors and othersemiconductors.

Multicore processors are being extended to include additionalfunctionality by incorporation of other functional units within theprocessor. One issue that arises is that the different circuitry canconsume differing amounts of power based on their workloads. However,suitable mechanisms to ensure that these different units have sufficientpower do not presently exist.

For example, a processor including different compute elements limits theamount of total power consumed to a level called the thermal designpower (TDP) limit. In addition to a configured TDP limit for aprocessor, an original equipment manufacturer (OEM) may limit the TDP ofthe processor even lower, to enable longer battery life, for differentform factors, etc. When running power hungry workloads on these systems,the available power (up to the TDP limit) is split between differentcompute elements. The manner in which the power is split affectsperformance of the system. The current approach to this powerdistribution problem is to use a fixed ratio for all workloads and TDPs,meaning that a certain portion of the power is allocated to thedifferent units of the processor. However, this approach is not optimalfor all workloads.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram of a method for controlling a power biasbetween different domains of a processor in accordance with anembodiment of the present invention.

FIG. 2 is a block diagram of a portion of a processor in accordance withan embodiment of the present invention.

FIG. 3 is a block diagram of a processor in accordance with anembodiment of the present invention.

FIG. 4 is a block diagram of a multi-domain processor in accordance withanother embodiment of the present invention.

FIG. 5 is a block diagram of an embodiment of a processor includingmultiple cores.

FIG. 6 is a block diagram of a system in accordance with an embodimentof the present invention.

DETAILED DESCRIPTION

In various embodiments, a power bias technique may be provided and usedin allocation of a power budget of a processor including multipledomains. In addition, a power bias value itself can be dynamicallyupdated during run time. As used herein the term “domain” is used tomean a collection of hardware and/or logic that operates at the samevoltage and frequency point. As an example, a multicore processor canfurther include other non-core processing engines such as fixed functionunits, graphics engines, and so forth. Other computing elements caninclude digital signal processors, processor communicationsinterconnects (buses, rings, etc.), and network processors. A processorcan include multiple independent domains, including a first domainassociated with the cores (referred to herein as a core or centralprocessing unit (CPU) domain) and a second domain associated with agraphics engine (referred to herein as a graphics or a graphicsprocessing unit (GPU) domain). Although many implementations of amulti-domain processor can be formed on a single semiconductor die,other implementations can be realized by a multi-chip package in whichdifferent domains can be present on different semiconductor die of asingle package. Further embodiments may be applicable for balancingpower between computing elements of a single system made up of manyindividual chip packages. For example, a CPU on one package and a GPU ona different package with a total system power limit of 50 watts can bemanaged as described herein.

In a multi-domain processor, the multiple domains collectively share asingle power budget. Accordingly, the higher the frequency at which,e.g., the core domain is operating, the higher the power consumed by thecore domain. And in turn, the higher the power consumed by the coredomain, there is less power left for the graphics domain to consume andvice versa.

For many applications executed on a processor, the core domain may actas a producer that generates workload data to be executed by thegraphics domain, which thus acts as a consumer. For example, for manyapplications, such as a 3 dimensional (3D)-based application, theprocessor may act to access data from a memory, write commands andinstructions into the memory, and provide data to the graphics domainfor performing graphics operations such as various shading, rendering,and other operations to thus generate pixel data for display on anassociated display.

In such applications, an intelligent bias control (IBC) in accordancewith an embodiment of the present invention may dynamically adjust thedistribution of power between the graphics domain and the core domainbased on workload demand. Embodiments may be particularly appropriatefor low power environments, such as where a processor is operating at athermal limit such as a thermal design power (TDP) level (or an evenlower level such as set by an OEM). In certain processors, apredetermined value of power sharing between multiple domains may beset, e.g., as part of a configuration of the system such as by a basicinput/output system (BIOS). Although such setting may be appropriate formany workloads, for certain applications, particularly where theprocessor is configured to be limited to operation below the TDP limit,e.g., for purposes of power management or so forth, performance can beimpacted. Instead using IBC in accordance with an embodiment of thepresent invention, a power bias value, which can be used to controlallocation of power between different domains, can itself be dynamicallycontrolled based on workload.

More specifically, embodiments may monitor operation of various domains,including core domain, graphics domain and an interconnect domain inorder to determine if one or more of these domains needs more or lesspower than it is currently receiving. If the core and interconnectdomains require less power than they are currently receiving in order tomaintain the graphics domain fully busy, power distribution may bebiased more toward the graphics domain, e.g., by adjustment to thispower bias value. Instead if the core domain and/or interconnect domainneed more power than they are currently receiving to maintain thegraphics domain fully occupied, power distribution may be biased moretowards the core domain and/or the interconnect domain, e.g., byopposite control of the power bias value.

In general, IBC in accordance with an embodiment of the presentinvention may generally operate as follows. First, an amount of work agiven application is creating for the core, graphics, and interconnectdomains can be measured. If it is determined based on these measurementsthat the graphics domain is idle, e.g., greater than a given thresholdamount of time during an evaluation interval, power balance may be movedtoward the core domain, e.g., by control of the power bias value.Instead if the graphics domain is being fully utilized during theevaluation interval, the power balance may be moved toward the graphicsdomain.

For ease of discussion, embodiments described herein are with regard toa multi-domain processor including a core domain and a graphics domainthat can share a power budget. However understand the scope of thepresent invention is not limited in this regard and additional domainscan be present. As another example, each core can be allocated to adifferent domain and each of the domains can be provided with adynamically re-partitionable amount of a power budget. Furthermore, inaddition to core domains and graphics domains, understand thatadditional domains can be present. For example, another domain can beformed of other processing units such as fixed function units,accelerators or so forth. And a still further domain can be provided forcertain management agents of a processor, which can receive a fixedportion of a total power budget.

Note that embodiments to perform intelligent bias control as describedherein may be independent of operating system (OS)-based powermanagement. For example, according to an OS-based mechanism, namely theAdvanced Configuration and Platform Interface (ACPI) standard (e.g.,Rev. 3.0b, published Oct. 10, 2006), a processor can operate at variousperformance states or levels, namely from P0 to PN. In general, the P1performance state may correspond to the highest guaranteed performancestate that can be requested by an OS. In addition to this P1 state, theOS can further request a higher performance state, namely a P0 state.This P0 state may thus be an opportunistic state in which, when powerand/or thermal budget is available, processor hardware can configure theprocessor or at least portions thereof to operate at a higher thanguaranteed frequency. In many implementations a processor can includemultiple so-called bin frequencies above this P1 frequency. In addition,according to ACPI, a processor can operate at various power states orlevels. With regard to power states, ACPI specifies different powerconsumption states, generally referred to as C-states, C0, C1 to Cnstates. When a core is active, it runs at a C0 state, and when the coreis idle it may be placed in a core low power state, also called a corenon-zero C-state (e.g., C1-C6 states). When all cores of a multicoreprocessor are in a core low power state, the processor can be placed ina package low power state, such as a package C6 low power state.

Referring now to FIG. 1, shown is a flow diagram of a method forcontrolling a power bias between different domains of a processor inaccordance with an embodiment of the present invention. As shown in FIG.1, method 100 can be implemented in various hardware, software and/orfirmware in different embodiments. For example, in an embodiment method100 may be implemented in power sharing logic of a power controller of aprocessor such as a power control unit (PCU). In another embodiment,method 100 can be implemented in a device driver such as a kernel modedriver (KMD), e.g., for a graphics domain. Note that the embodimentdescribed in FIG. 1 is with regard to analysis and control of power biasbetween a CPU domain and a graphics domain. Although described assharing power between these particular domains in the embodiment of FIG.1, understand the scope of the present invention is not limited in thisregard and in other embodiments, dynamic power sharing may be betweenother types of domains having heterogeneous compute elements.

As seen in FIG. 1, portions of method 100 (generally through diamond150) may be performed for each frame of graphics rendered during anevaluation interval. As seen, method 100 begins by determining a targetfrequency for a CPU domain and an interconnect domain (block 110). Inthe embodiment shown, this interconnect domain may have a ring-basedinterconnect, details of which will be discussed further below. In anembodiment, the determinations of these target frequencies may be basedon certain metrics received from various locations of the processor. Ingeneral, these metrics may correspond to a busyness of the graphicsdomain and a busyness of the interconnect domain. That is, becausetypically the graphics domain is a consumer of data processed by the CPUdomain (and which is thus the producer domain), using the graphicsdomain busyness (and the ring domain busyness) as a proxy may enabledetermination of an appropriate target frequency for the CPU domain (andthe interconnect domain). In the illustrated embodiment, the busynesscan be measured based on activity levels of various components orlocations within a micro-architecture of the graphics domain and theinterconnect domain. Or in another embodiment, the busyness can bedetermined based on a utilization rate of these components.

Still referring to FIG. 1, control next passes to block 120 where atarget frequency can be set as an upper limit on the CPU domainfrequency and interconnect domain frequency for a next evaluationinterval. In an embodiment, this setting may be made by storing thetarget frequency in a maximum frequency storage, such as a configurationregister, e.g., present in or associated with the PCU. A single targetfrequency may be set, or independent target frequencies for the CPU andinterconnect domains may be set.

As will be described below, these domains may be controlled in the nextevaluation interval (e.g., by the PCU) to operate at or lower than thistarget frequency. Note that on a determination to lower the targetfrequency (as when the CPU/interconnect are not being fully utilized)the amount of reduction may be by a first amount (e.g., 200 MHz), in anembodiment, to gradually decrease target frequency. Instead on adetermination to raise the target frequency (as when theCPU/interconnect are fully utilized), the amount of increase may be by asecond, higher amount (e.g., 500 MHz) in an embodiment, to more rapidlyincrease target frequency.

Control next passes to diamond 130 where it can be determined whetherthe CPU has been running within a guard band of this target frequency(e.g., during the current evaluation interval). Note that the guard bandthus provides a measure of filtering of short frequency changes, e.g.,due to brief spikes and dips in processor utilization, as well asperformance and power state changes (respectively P-state and C-statechanges). Thus as a result of such possible effects, a lower thanrequested CPU frequency over the time interval under analysis may occur.By using a guard band to guide power bias updates, increasing the powerbias value too far towards the CPU domain based on such normal rhythmsof CPU frequency can be prevented. In an embodiment, the guard band maybe a predetermined percentage of the target frequency, e.g., 5-10%,although the scope of the present invention is not limited in thisregard.

If it is determined that the CPU has been running within this guardband, control passes to block 140 where a counter associated with agraphics domain can be incremented. More specifically, a graphics biashysteresis counter, which may be a counter located in a logic of theprocessor, may be incremented. Otherwise if the CPU frequency is lowerthan the target frequency by greater than the guard band, this is anindication that the CPU frequency is too low, and thus sufficientthroughput from the CPU to the graphics domain is not being realized.Accordingly, control passes to block 135 where a CPU counter can beincremented, which may be another counter located in a logic of theprocessor. More specifically, a CPU bias hysteresis counter can beincremented. Note that by providing a measure of hysteresis by way ofthese counters, bias changes can be smoothed out. In this way, one timeanomalous frame occurrences can avoid moving the bias too far in onedirection, causing an overall performance slowdown.

Although not shown, in some embodiments a frame counter also may beincremented. This counter may thus provide a count of the number offrame rendered (and thus the number of loops through this portion ofmethod 100). Then, from both of blocks 135 and 140, control passes todiamond 150 where it can be determined whether sufficient time haselapsed since the last power bias change. Although the scope of thepresent invention is not limited in this regard, in one embodiment thisthreshold time period or time window may be on the order onapproximately 250 milliseconds (ms). Note that this time window-basedanalysis may be used to handle applications having different framerates. When power sharing in accordance with an embodiment of thepresent invention is tuned to optimally run with low frame rateapplications, it may result in frame rate oscillations in high framerate applications due to the bias overshooting the optimum value.Instead by using a constant value time window, the rate of bias changemay be normalized to be independent of frame rate of a given applicationunder execution.

Referring still to FIG. 1, if sufficient time has elapsed since the lastpower bias change, control passes to diamond 160 where it can bedetermined whether the GPU bias hysteresis counter has exceeded a firstthreshold level. Although the scope of the present invention is notlimited in this regard, in an embodiment, this first threshold level maybe set at a given count value, which in some embodiments may be set attwo. Also, it can be determined whether the number of frames for which adetermination of greater GPU power is made (which in an embodiment canequal the GPU bias counter value) exceeds a first threshold percentageof frames. In an embodiment the percentage may be between approximately55 and 65% of the number of frames analyzed. Thus if both the GPU biascounter exceeds the threshold level and greater than the predeterminedpercentage of analyzed frames indicates that greater graphics domainpower is needed, control passes to block 165 where a power bias valuecan be increased towards the graphics domain. Although described withthese multiple determinations, in other embodiments only a single one ofthese determinations may be performed (e.g., a comparison of a count toa threshold or a percentage of frames indicating a greater powerconsumption level is desired).

A variable amount of increase to this power bias value may be providedin certain embodiments. However, for ease of implementation, in otherembodiments a fixed value of the increase may be implemented. In aparticular embodiment, the increase may be in terms of percentage andmay correspond to, e.g., between approximately 1-2%. Note that thispower bias value may be a configuration register, e.g., present in thePCU that enables the PCU to dynamically allocate power budget betweendifferent domains. In some embodiments, as the graphics domain is theprimary consumer of CPU domain activity, this bias value can be set toan initial level of between approximately 80-90% in favor of thegraphics domain, meaning that of the power budget allocated between thegraphics domain and the CPU domain, the given percentage correspondingto the power bias value may be allocated to the graphics domain.

If instead at diamond 160 it is determined that both the GPU biashysteresis counter does not exceed the GPU (first) threshold and thenumber of frames seeking greater GPU power does not exceed the firstthreshold number of frames, control passes to diamond 170 where it canbe determined whether the CPU bias hysteresis counter has exceeded asecond threshold level (which in different embodiments may be the sameor different value than the first threshold level) and whether thenumber of frames for which a determination of greater CPU power is made(which in an embodiment can equal the CPU bias counter value) exceeds asecond threshold percentage of frames (which in different embodimentsmay be the same or different than the first threshold percentage offrames). If both these determinations are in the positive, controlpasses to block 175 where a power bias value can be increased towardsthe core domain. Otherwise, the method concludes for this set of frameswithout any adjustment to the power bias value. Although shown at thishigh level in the embodiment of FIG. 1, understand the scope of thepresent invention is not limited in this regard. For example, in otherembodiments some of the techniques to provide a measure of hysteresis orreduced adjustment to the power bias value may not be present. Thus inother embodiments one or more of the hysteresis counters, the guard bandanalysis, the time window determination, and the percentage of framesanalysis may be eliminated.

As an example of the determinations, assume 11 frames have elapsed, ofwhich six indicated a need for more CPU power and five indicated less(and thus the CPU bias counter equals six and the GPU bias counterequals five). Also assume both bias counter thresholds are set at twoand the percentage thresholds are set at 60%. In this scenario, nochange to the bias value occurs, despite the fact that both hysteresiscounters passed the two frame threshold. This is so to reduce theoscillation around an optimal bias point.

Referring now to FIG. 2, shown is a block diagram of a portion of aprocessor in accordance with an embodiment of the present invention. Asshown in FIG. 2, processor 200 may include a kernel mode driver 210.More specifically, this driver may be configured as a device driver fora graphics domain to thus interface the graphics domain to an OSexecuting on the processor. As seen, driver 210 may include a powerconservation logic 215 which may include various components to analyzedifferent metrics of processor performance (more particularly withregard to the graphics domain) and to enable power conservation whenpossible.

In the embodiment shown, power conservation logic 215 may include adynamic frequency logic 216 that may be used to determine optimal coreand interconnect frequencies based on a workload being performed by thegraphics domain. Thus as seen, logic 216 may output upper limit valuesfor the frequency of both the core domain and the interconnect domain.

As seen in the embodiment of FIG. 2, these values may be provided to aPCU 230 which may include various hardware, software and/or firmware toperform power management operations, responsive to inputs from powerconservation logic 215 as well as based on other inputs, such as variousmodes of operation as instructed, e.g., by the OS.

As further seen in FIG. 2, power conservation logic 215 may furtherinclude an intelligent bias control (IBC) logic 218 in accordance withan embodiment of the present invention. In general, logic 218 mayoperate according to the algorithm of FIG. 1 above to control a powerbias value, which can be communicated to PCU 230. This power bias valuethus indicates to the PCU how a shared power budget between core domainand the graphics domain is to be shared.

Referring still to FIG. 2, note that the operations performed withinpower conservation logic, including the frequency analysis performed bydynamic frequency logic 216 and the power bias analysis performed by IBClogic 218 may be performed responsive to various inputs. Morespecifically as seen in FIG. 2, power conservation logic 215 (and thusits constituent logics 216 and 218) may receive inputs from a number oflocations within the processor. In the illustrated embodiment, theselocations may include from core machine specific registers (MSRs) 220,uncore or system agent MSRs 222, PCU registers 224, which in anembodiment can be implemented using memory mapped IO (MMIO) registers,and graphics registers 226.

Various types of information may be received, including, for examplevarious counter values from core MSR 220. In one embodiment, thesecounter values may include a timestamp counter (TSC) value, as well asother time-based counter values such as an actual count (ACNT) and amaximum count (MCNT). For example, a ratio of these values can indicatethe average frequency over an evaluation interval. Various platforminformation such as memory utilization, L3 cache utilization/hit/misscounters, etc., may be received from system agent MSR 222 and used forestimating ring busyness. In turn, information received from PCUregister 224 may include various fused values, such as maximumfrequencies of operation and so forth. In addition, current frequenciesat which various domains such as the CPU, graphics and interconnectdomains are operating, may also be provided. Finally, GPU registers 226may provide activity counts and other information indicative of busynessof the graphics domain. All of this data about actual platformbehavior/busyness can then be used by the IBC logic to calculate optimumpower bias values.

With regard to PCU 230, based on the upper limits identified for the CPUdomain and interconnect domain, and the requested power bias(corresponding to a determined power bias value), an appropriate mix ofpower to be allocated to the core domain and the graphics domain can bedetermined. Based on this determination, PCU 230 may set an appropriatevoltage and frequency combination for these independent domains. Thus asshown in FIG. 2, PCU 230 may output voltage and frequency values. Inturn, these values may be used, e.g., by internal control logic of thedomains, clock generators, voltage regulators, and so forth, to operateat the instructed levels. Although shown at this high level in theembodiment of FIG. 2, understand the scope of the present invention isnot limited in this regard. That is while the example shown in FIG. 2 iswith regard to IBC being performed within a kernel mode driver,understand that in other embodiments such control can be implementedwithin the PCU itself or as logic gates in other locations within aprocessor.

Referring now to FIG. 3, shown is a block diagram of a processor inaccordance with an embodiment of the present invention. As shown in FIG.3, processor 300 may be a multicore processor including a plurality ofcores 310 _(a)-310 _(n) in a core domain 310. In one embodiment, eachsuch core may be of an independent power domain and can be configured tooperate at an independent voltage and/or frequency, and to enter turbomode when available headroom exists, or the cores can be uniformlycontrolled as a single domain. As further shown in FIG. 3, one or moreGPUs 312 ₀-312 _(n) may be present in a graphics domain 312. Each ofthese independent graphics engines also may be configured to operate atindependent voltage and/or frequency or may be controlled together as asingle domain. These various compute elements may be coupled via aninterconnect 315 to a system agent or uncore 320 that includes variouscomponents. As seen, the uncore 320 may include a shared cache 330 whichmay be a last level cache. In addition, the uncore may include anintegrated memory controller 340, various interfaces 350 and a powercontrol unit 355.

In various embodiments, power control unit 355 may include a powersharing logic 359, which may be a logic to perform dynamic control andre-allocation of an available power budget between multiple independentdomains of the processor. In the embodiment of FIG. 3, power sharinglogic 359 may include an IBC logic 357 to dynamically set a power biasvalue between core domain 310 and graphics domain 312, e.g., based onthe busyness of these components as well as the busyness of interconnect315. To this end, PCU 355 may include various registers or otherstorages, both to store a power bias value as well as to store an upperlimit on core frequency and interconnect frequency as determined by theIBC logic or other such logic of the PCU. Although shown at thislocation in the embodiment of FIG. 3, understand that the scope of thepresent invention is not limited in this regard and the storage of thislogic can be in other locations.

With further reference to FIG. 3, processor 300 may communicate with asystem memory 360, e.g., via a memory bus. In addition, by interfaces350, connection can be made to various off-chip components such asperipheral devices, mass storage and so forth. While shown with thisparticular implementation in the embodiment of FIG. 3, the scope of thepresent invention is not limited in this regard.

Referring now to FIG. 4, shown is a block diagram of a multi-domainprocessor in accordance with another embodiment of the presentinvention. As shown in the embodiment of FIG. 4, processor 400 includesmultiple domains. Specifically, a core domain 410 can include aplurality of cores 410 ₀-410 _(n), a graphics domain 420 can include oneor more graphics engines, and a system agent domain 450 may further bepresent. In various embodiments, system agent domain 450 may remainpowered on at all times to handle power control events and powermanagement such that domains 410 and 420 can be controlled todynamically enter into and exit low power states. In addition, thesedomains can dynamically share a power budget between them based at leastin part on a power bias value determined in accordance with anembodiment of the present invention. Each of domains 410 and 420 mayoperate at different voltage and/or power.

Note that while only shown with three domains, understand the scope ofthe present invention is not limited in this regard and additionaldomains can be present in other embodiments. For example, multiple coredomains may be present each including at least one core. In this way,finer grained control of the amount of processor cores that can beexecuting at a given frequency can be realized.

In general, each core 410 may further include low level caches inaddition to various execution units and additional processing elements.In turn, the various cores may be coupled to each other and to a sharedcache memory formed of a plurality of units of a last level cache (LLC)440 ₀-440 _(n). In various embodiments, LLC 440 may be shared amongstthe cores and the graphics engine, as well as various media processingcircuitry. As seen, a ring interconnect 430 thus couples the corestogether, and provides interconnection between the cores, graphicsdomain 420 and system agent circuitry 450.

In the embodiment of FIG. 4, system agent domain 450 may include displaycontroller 452 which may provide control of and an interface to anassociated display. As further seen, system agent domain 450 may includea power control unit 455 which can include a power sharing logic 459 inaccordance with an embodiment of the present invention. In variousembodiments, this logic may execute an algorithm such as shown in FIG. 1to thus dynamically share an available package power budget between thecore domain and the graphics domain.

As further seen in FIG. 4, processor 400 can further include anintegrated memory controller (IMC) 470 that can provide for an interfaceto a system memory, such as a dynamic random access memory (DRAM).Multiple interfaces 480 ₀-480 _(n) may be present to enableinterconnection between the processor and other circuitry. For example,in one embodiment at least one direct media interface (DMI) interfacemay be provided as well as one or more Peripheral Component InterconnectExpress (PCI Express™ (PCIe™)) interfaces. Still further, to provide forcommunications between other agents such as additional processors orother circuitry, one or more interfaces in accordance with a Intel®Quick Path Interconnect (QPI) protocol may also be provided. Althoughshown at this high level in the embodiment of FIG. 4, understand thescope of the present invention is not limited in this regard.

Referring to FIG. 5, an embodiment of a processor including multiplecores is illustrated. Processor 1100 includes any processor orprocessing device, such as a microprocessor, an embedded processor, adigital signal processor (DSP), a network processor, a handheldprocessor, an application processor, a co-processor, a system on a chip(SOC), or other device to execute code. Processor 1100, in oneembodiment, includes at least two cores—cores 1101 and 1102, which mayinclude asymmetric cores or symmetric cores (the illustratedembodiment). However, processor 1100 may include any number ofprocessing elements that may be symmetric or asymmetric.

In one embodiment, a processing element refers to hardware or logic tosupport a software thread. Examples of hardware processing elementsinclude: a thread unit, a thread slot, a thread, a process unit, acontext, a context unit, a logical processor, a hardware thread, a core,and/or any other element, which is capable of holding a state for aprocessor, such as an execution state or architectural state. In otherwords, a processing element, in one embodiment, refers to any hardwarecapable of being independently associated with code, such as a softwarethread, operating system, application, or other code. A physicalprocessor typically refers to an integrated circuit, which potentiallyincludes any number of other processing elements, such as cores orhardware threads.

A core often refers to logic located on an integrated circuit capable ofmaintaining an independent architectural state, wherein eachindependently maintained architectural state is associated with at leastsome dedicated execution resources. In contrast to cores, a hardwarethread typically refers to any logic located on an integrated circuitcapable of maintaining an independent architectural state, wherein theindependently maintained architectural states share access to executionresources. As can be seen, when certain resources are shared and othersare dedicated to an architectural state, the line between thenomenclature of a hardware thread and core overlaps. Yet often, a coreand a hardware thread are viewed by an operating system as individuallogical processors, where the operating system is able to individuallyschedule operations on each logical processor.

Physical processor 1100, as illustrated in FIG. 5, includes two cores,cores 1101 and 1102. Here, cores 1101 and 1102 are considered symmetriccores, i.e., cores with the same configurations, functional units,and/or logic. In another embodiment, core 1101 includes an out-of-orderprocessor core, while core 1102 includes an in-order processor core.However, cores 1101 and 1102 may be individually selected from any typeof core, such as a native core, a software managed core, a core adaptedto execute a native instruction set architecture (ISA), a core adaptedto execute a translated ISA, a co-designed core, or other known core.Yet to further the discussion, the functional units illustrated in core1101 are described in further detail below, as the units in core 1102operate in a similar manner.

As depicted, core 1101 includes two hardware threads 1101 a and 1101 b,which may also be referred to as hardware thread slots 1101 a and 1101b. Therefore, software entities, such as an operating system, in oneembodiment potentially view processor 1100 as four separate processors,i.e., four logical processors or processing elements capable ofexecuting four software threads concurrently. As alluded to above, afirst thread is associated with architecture state registers 1101 a, asecond thread is associated with architecture state registers 1101 b, athird thread may be associated with architecture state registers 1102 a,and a fourth thread may be associated with architecture state registers1102 b. Here, each of the architecture state registers (1101 a, 1101 b,1102 a, and 1102 b) may be referred to as processing elements, threadslots, or thread units, as described above. As illustrated, architecturestate registers 1101 a are replicated in architecture state registers1101 b, so individual architecture states/contexts are capable of beingstored for logical processor 1101 a and logical processor 1101 b. Incore 1101, other smaller resources, such as instruction pointers andrenaming logic in allocator and renamer block 1130 may also bereplicated for threads 1101 a and 1101 b. Some resources, such asre-order buffers in reorder/retirement unit 1135, ILTB 1120, load/storebuffers, and queues may be shared through partitioning. Other resources,such as general purpose internal registers, page-table base register(s),low-level data-cache and data-TLB 1115, execution unit(s) 1140, andportions of out-of-order unit 1135 are potentially fully shared.

Processor 1100 often includes other resources, which may be fullyshared, shared through partitioning, or dedicated by/to processingelements. In FIG. 5, an embodiment of a purely exemplary processor withillustrative logical units/resources of a processor is illustrated. Notethat a processor may include, or omit, any of these functional units, aswell as include any other known functional units, logic, or firmware notdepicted. As illustrated, core 1101 includes a simplified,representative out-of-order (OOO) processor core. But an in-orderprocessor may be utilized in different embodiments. The OOO coreincludes a branch target buffer 1120 to predict branches to beexecuted/taken and an instruction-translation buffer (I-TLB) 1120 tostore address translation entries for instructions.

Core 1101 further includes decode module 1125 coupled to fetch unit 1120to decode fetched elements. Fetch logic, in one embodiment, includesindividual sequencers associated with thread slots 1101 a, 1101 b,respectively. Usually core 1101 is associated with a first ISA, whichdefines/specifies instructions executable on processor 1100. Oftenmachine code instructions that are part of the first ISA include aportion of the instruction (referred to as an opcode), whichreferences/specifies an instruction or operation to be performed. Decodelogic 1125 includes circuitry that recognizes these instructions fromtheir opcodes and passes the decoded instructions on in the pipeline forprocessing as defined by the first ISA. For example, decoders 1125, inone embodiment, include logic designed or adapted to recognize specificinstructions, such as transactional instruction. As a result of therecognition by decoders 1125, the architecture or core 1101 takesspecific, predefined actions to perform tasks associated with theappropriate instruction. It is important to note that any of the tasks,blocks, operations, and methods described herein may be performed inresponse to a single or multiple instructions; some of which may be newor old instructions.

In one example, allocator and renamer block 1130 includes an allocatorto reserve resources, such as register files to store instructionprocessing results. However, threads 1101 a and 1101 b are potentiallycapable of out-of-order execution, where allocator and renamer block1130 also reserves other resources, such as reorder buffers to trackinstruction results. Unit 1130 may also include a register renamer torename program/instruction reference registers to other registersinternal to processor 1100. Reorder/retirement unit 1135 includescomponents, such as the reorder buffers mentioned above, load buffers,and store buffers, to support out-of-order execution and later in-orderretirement of instructions executed out-of-order.

Scheduler and execution unit(s) block 1140, in one embodiment, includesa scheduler unit to schedule instructions/operation on execution units.For example, a floating point instruction is scheduled on a port of anexecution unit that has an available floating point execution unit.Register files associated with the execution units are also included tostore information instruction processing results. Exemplary executionunits include a floating point execution unit, an integer executionunit, a jump execution unit, a load execution unit, a store executionunit, and other known execution units.

Lower level data cache and data translation buffer (D-TLB) 1150 arecoupled to execution unit(s) 1140. The data cache is to store recentlyused/operated on elements, such as data operands, which are potentiallyheld in memory coherency states. The D-TLB is to store recentvirtual/linear to physical address translations. As a specific example,a processor may include a page table structure to break physical memoryinto a plurality of virtual pages.

Here, cores 1101 and 1102 share access to higher-level or further-outcache 1110, which is to cache recently fetched elements. Note thathigher-level or further-out refers to cache levels increasing or gettingfurther away from the execution unit(s). In one embodiment, higher-levelcache 1110 is a last-level data cache—last cache in the memory hierarchyon processor 1100—such as a second or third level data cache. However,higher level cache 1110 is not so limited, as it may be associated withor includes an instruction cache. A trace cache—a type of instructioncache—instead may be coupled after decoder 1125 to store recentlydecoded traces.

In the depicted configuration, processor 1100 also includes businterface module 1105 and a power controller 1160, which may performpower sharing control in accordance with an embodiment of the presentinvention. Historically, controller 1170 has been included in acomputing system external to processor 1100. In this scenario, businterface 1105 is to communicate with devices external to processor1100, such as system memory 1175, a chipset (often including a memorycontroller hub to connect to memory 1175 and an I/O controller hub toconnect peripheral devices), a memory controller hub, a northbridge, orother integrated circuit. And in this scenario, bus 1105 may include anyknown interconnect, such as multi-drop bus, a point-to-pointinterconnect, a serial interconnect, a parallel bus, a coherent (e.g.cache coherent) bus, a layered protocol architecture, a differentialbus, and a GTL bus.

Memory 1175 may be dedicated to processor 1100 or shared with otherdevices in a system. Common examples of types of memory 1175 includeDRAM, SRAM, non-volatile memory (NV memory), and other known storagedevices. Note that device 1180 may include a graphic accelerator,processor or card coupled to a memory controller hub, data storagecoupled to an I/O controller hub, a wireless transceiver, a flashdevice, an audio controller, a network controller, or other knowndevice.

Note however, that in the depicted embodiment, the controller 1170 isillustrated as part of processor 1100. Recently, as more logic anddevices are being integrated on a single die, such as SOC, each of thesedevices may be incorporated on processor 1100. For example in oneembodiment, memory controller hub 1170 is on the same package and/or diewith processor 1100. Here, a portion of the core (an on-core portion)includes one or more controller(s) 1170 for interfacing with otherdevices such as memory 1175 or a graphics device 1180. The configurationincluding an interconnect and controllers for interfacing with suchdevices is often referred to as an on-core (or un-core configuration).As an example, bus interface 1105 includes a ring interconnect with amemory controller for interfacing with memory 1175 and a graphicscontroller for interfacing with graphics processor 1180. Yet, in the SOCenvironment, even more devices, such as the network interface,co-processors, memory 1175, graphics processor 1180, and any other knowncomputer devices/interface may be integrated on a single die orintegrated circuit to provide small form factor with high functionalityand low power consumption.

Embodiments may be implemented in many different system types. Referringnow to FIG. 6, shown is a block diagram of a system in accordance withan embodiment of the present invention. As shown in FIG. 6,multiprocessor system 500 is a point-to-point interconnect system, andincludes a first processor 570 and a second processor 580 coupled via apoint-to-point interconnect 550. As shown in FIG. 6, each of processors570 and 580 may be multicore processors, including first and secondprocessor cores (i.e., processor cores 574 a and 574 b and processorcores 584 a and 584 b), although potentially many more cores may bepresent in the processors. Each of the processors can include a PCU orother logic to perform dynamic allocation of a package power budgetbetween multiple domains of the processor based, at least in part, on apower bias value, as described herein.

Still referring to FIG. 6, first processor 570 further includes a memorycontroller hub (MCH) 572 and point-to-point (P-P) interfaces 576 and578. Similarly, second processor 580 includes a MCH 582 and P-Pinterfaces 586 and 588. As shown in FIG. 6, MCH's 572 and 582 couple theprocessors to respective memories, namely a memory 532 and a memory 534,which may be portions of system memory (e.g., DRAM) locally attached tothe respective processors. First processor 570 and second processor 580may be coupled to a chipset 590 via P-P interconnects 552 and 554,respectively. As shown in FIG. 6, chipset 590 includes P-P interfaces594 and 598.

Furthermore, chipset 590 includes an interface 592 to couple chipset 590with a high performance graphics engine 538, by a P-P interconnect 539.In turn, chipset 590 may be coupled to a first bus 516 via an interface596. As shown in FIG. 6, various input/output (I/O) devices 514 may becoupled to first bus 516, along with a bus bridge 518 which couplesfirst bus 516 to a second bus 520. Various devices may be coupled tosecond bus 520 including, for example, a keyboard/mouse 522,communication devices 526 and a data storage unit 528 such as a diskdrive or other mass storage device which may include code 530, in oneembodiment. Further, an audio I/O 524 may be coupled to second bus 520.Embodiments can be incorporated into other types of systems includingmobile devices such as a smart cellular telephone, Ultrabook™, tabletcomputer, netbook, or so forth.

Embodiments may be implemented in code and may be stored on anon-transitory storage medium having stored thereon instructions whichcan be used to program a system to perform the instructions. The storagemedium may include, but is not limited to, any type of disk includingfloppy disks, optical disks, solid state drives (SSDs), compact diskread-only memories (CD-ROMs), compact disk rewritables (CD-RWs), andmagneto-optical disks, semiconductor devices such as read-only memories(ROMs), random access memories (RAMs) such as dynamic random accessmemories (DRAMs), static random access memories (SRAMs), erasableprogrammable read-only memories (EPROMs), flash memories, electricallyerasable programmable read-only memories (EEPROMs), magnetic or opticalcards, or any other type of media suitable for storing electronicinstructions.

While the present invention has been described with respect to a limitednumber of embodiments, those skilled in the art will appreciate numerousmodifications and variations therefrom. It is intended that the appendedclaims cover all such modifications and variations as fall within thetrue spirit and scope of this present invention.

What is claimed is:
 1. A processor comprising: a first domain having afirst compute engine; a second domain having a second compute engine,the first and second compute engines asymmetrical, each of the first andsecond domains to operate at an independent voltage and frequency; afirst domain counter and a second domain counter, wherein one of thefirst and second domain counters is to be updated based on whether afrequency of the first domain is substantially around a target frequencyfor the first domain; a power bias logic to dynamically update a powerbias value stored in a configuration register, the power bias value usedto control dynamic allocation of power between the first and seconddomains based at least in part on a measure of a busyness of the seconddomain, the power bias value corresponding to a portion of a powerbudget for the processor to be allocated to one of the first and seconddomains, wherein the power bias logic is to dynamically update the powerbias value based at least in part on one of the first domain counter andthe second domain counter; and a power controller to dynamicallyallocate at least a portion of the power budget between the first andsecond domains based at least in part on the power bias value.
 2. Theprocessor of claim 1, wherein the second domain is a consumer domain andthe first domain is a producer domain, the first compute engine is acore, the core including a decode unit to decode instructions, at leastone execution unit to execute decoded instructions, and a retirementunit to retire executed instructions; and the second compute engine is agraphics engine.
 3. The processor of claim 1, wherein the power biaslogic is to determine the target frequency based on the measure of thebusyness of the second domain.
 4. The processor of claim 3, wherein thepower bias logic is to determine the target frequency further based on ameasure of a busyness of an interconnect domain that couples the firstand second domains.
 5. The processor of claim 1, wherein the power biaslogic is to update the second domain counter when the frequency of thefirst domain is within a threshold of the target frequency and to updatethe first domain counter when the first domain frequency is not withinthe threshold of the target frequency.
 6. The processor of claim 5,wherein the power bias logic is to compare the second domain counter toa second threshold, if a threshold time period has occurred since a lastupdate to the power bias value.
 7. The processor of claim 6, wherein thepower bias logic is to adjust the power bias value in favor of thesecond domain when the second domain counter is greater than the secondthreshold.
 8. An apparatus comprising: a multicore processor having afirst domain including a plurality of cores, a second domain includingat least one graphics engine, and a third domain including system agentcircuitry, the third domain to operate at a fixed power budget andincluding a power controller to control delivery of power to the firstand second domains, and a power bias logic to dynamically determine apower bias value to indicate a bias of the power delivery as between thefirst and second domains responsive to a measure of a busyness of thesecond domain, a measure of a busyness of an interconnect domain thatcouples the first and second domains and a value of at least one of afirst counter and a second counter and to communicate the power biasvalue to the power controller for storage in a configuration register,wherein the power bias logic is to update the second counter when afrequency of the first domain is within a threshold of the targetfrequency for the first domain and update the first counter when thefirst domain frequency is not within the threshold of the targetfrequency for the first domain.
 9. The apparatus of claim 8, wherein thepower bias logic is to set the target frequency for the first domainbased on the interconnect domain busyness and the second domainbusyness.
 10. The apparatus of claim 9, wherein the power bias logic isto communicate the target frequency to the power controller, the powercontroller to store the target frequency in a maximum frequency registerand to limit a frequency of the first domain to the target frequency.11. The apparatus of claim 8, wherein the power bias logic is to adjustthe power bias value in favor of the second domain when the secondcounter is greater than a second threshold, and to adjust the power biasvalue in favor of the first domain when the first counter is greaterthan a first threshold.
 12. The apparatus of claim 11, wherein the powerbias logic is to compare one of the first and second counters to thecorresponding one of the first and second thresholds, if a thresholdtime period has occurred since a last update to the power bias value.13. The apparatus of claim 8, wherein the power controller is to controlthe power delivery to the first and second domains based at least inpart on the power bias value.
 14. A non-transitory machine-readablemedium having stored thereon instructions, which if performed by amachine cause the machine to perform a method comprising: determining,in a first logic of a processor, a target frequency for a core domain ofthe processor based on an activity level of a graphics domain of theprocessor and an activity level of an interconnect domain of theprocessor; determining whether the core domain has been operating withina threshold level of the target frequency during an evaluation interval;updating one of a first counter and a second counter based on thedetermination; and dynamically adjusting a power bias value based on atleast one of the first and second counters, wherein the power bias valueindicates a portion of a power budget to be allocated to one of the coredomain and the graphics domain.
 15. The non-transitory machine-readablemedium of claim 14, wherein the method further comprises providing thetarget frequency to a power controller of the processor, wherein thepower controller is to limit a frequency of the core domain to thetarget frequency.
 16. The non-transitory machine-readable medium ofclaim 14, wherein the method further comprises performing the targetfrequency determining and the updating of at least one of the first andsecond counters for each graphics frame rendered by the graphics domainduring a first time period, and adjusting the power bias value after thefirst time period.